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ABSTRACT 


Consider  the  linear  regression  model  Y  =  X0+  e.  Recently,  a  class  of 
estimators ,  variously  known  as  ridge  estimators ,  has  been  proposed  as  an 
alternative  to  the  least  squares  estimators  in  the  case  of  collinearity , 
that  is,  when  the  design  matrix  X'X  is  nearly  singular.  The  ridge  estimator 
is  given  by  0  =  (X'X  +  KI)“1  X'Y,  where  K  is  a  constant  to  be  determined. 

An  optimal  choice  of  the  value  of  K  is  not  known.  This  paper  examines  the 
risk  (mean  squared  error)  of  the  ridge  estimator  under  the  constraint  0'0£c 
and  determines  optimal  values  of  K  for  which  the  risk  is  smaller  than  the 
risk  of  the  least  squares  estimators  where  c  is  a  constant. 
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1.  Introduction.  In  applications  of  multiple  linear  regression,  the 
explanatory  variables  under  consideration  are  often  interrelated.  The  relation 
is  technically  called  multicollinearity  or  near  multicollinearity .  The  ordinary 
least  squares  estimate  of  the  regression  coefficients  tends  to  become  "unstable" 
in  the  presence  of  multicollinearity.  More  precisely,  the  variance  of  some  of 
the  regression  coefficients  becomes  large.  Hoerl  (1959),  (1962)  and  Hoerl  and 
Kennard  (1970, a),  (1970, b)  have  suggested  a  class  of  estimators  known  as  ridge 
estimators  as  an  alternative  to  the  least  squares  estimation  in  the  presence  of 
multicollinearity.  The  new  method  of  estimation  is  called  ridge  analysis. 

Ridge  analysis  has  drawn  considerable  interest  in  recent  years.  The  technique 
has  been  developed  and  new  results  have  been  obtained  by  several  authors,  e.g. , 
Hawkins  (1975),  Hemmerle  (1975) ,  Sidik  (1975).  Newhouse  and  Oman  (1975)  have  conducted 
a  series  of  Monte  Carlo  experiments  to  compare  the  performance  of  ridge  analysis 
with  the  least  squares. 

The  ridge  analysis  is  an  ad  hoc  procedure  which  gives  a  biased  estimator. 

We  compare  the  ridge  estimator  with  the  least  squares  estimator  with  respect  to 
the  mean  squared  error.  Hoerl  and  Kennard  (1970, a)  have  claimed  in  their  paper 
that  for  a  certain  choice  of  a  parameter  (K)  the  ridge  estimator  is  uniformly 
superior  to  the  least  squares  estimator.  This  is  not  true.  It  appears  that  the 
limitation  of  any  optimal  property  of  the  ridge  estimator  and  its  relation  to 
other  known  estimators  is  not  often  clearly  comprehended  by  many  applied  statisti¬ 
cians  engaged  in  regression  analysis.  The  object  of  this  paper  is  to  expose  the 
essential  features  of  ridge  analysis. 

In  the  following  section  we  show  a  basis  for  the  choice  of  the  ridge  estimator, 
its  biased  character  and  compare  it  with  an  unbiased  estimator.  Furthermore,  we 
obtain  conditions  under  which  the  ridge  estimator  has  smaller  mean  squared  error 
than  the  least  squared  estimator. 

2.  Ridge  analysis.  Consider  the  linear  regression  model 

Y  =  X6  +  e 

where  Y  is  n  '  1  vector  of  observations,  X  is  n  '  p  design  matrix,  6  is  p  *  1  vector 
of  unknown  parameters  e  is  n  '  1  vector  of  the  observational  errors.  It  is 
assumed  that  the  components  of  e  are  uncorrelated,  and  have  a  common  variance  equal 
2 

to  3  ,  say.  Also,  E(e)  =  0.  Let  prime  denote  the  transpose  of  a  vector  or  matrix. 
The  least  squares  estimate  of  c  is  obtained  by  minimizing  (Y-X6 ) ’ (Y-X9 )  with  respect 
to  d,  and  is  given  by 


(2.1) 


-,'u 
^  ' 


e  =  (x'x)'1  x'y. 

It  is  assumed  that  the  columns  of  X  are  linearly  independent  and  therefore  the 
rank  of  the  design  matrix  X' X  is  equal  to  p. 

WehaveE[0]  =  0.  That  is,  0  is  an  unbiased  estimator  of  9.  Let  X1  ,  X 

denote  the  characteristic  roots  of  X'X. 
given  by 

(2.2)  E(0-e)'(0-0) 


If  the  explanatory  variables  are  nearly  multicollinear  then  the  matrix  X'X  is 
illconditioned,  that  is,  pne  (or  more)  of  the  characteristic  roots  of  X'X  is 
smalj..  In  that  case  MSE  9  becomes  large,  as  it  is  seen  from  (2.2).  We  can  avoid 
MSE  0  becoming  large  by  inflating  the  characteristic  roots.  That  is,  substituting 
0  for  0,  given  by 

(2.3)  0  =  (X’X+KI)'1  X'Y 

where  I  is  pxp  identity  matrix  and  K  is  a  positive  number.  The  estimator  0  is 
the  ridge  estimator,  proposed  by  Hoerl  and  Kennard  as  an  alternative  to  the  least 
squares  estimator. 

We  have 


The  mean  squared  error  of  9  (MSE0)  is 

2  -1 
=  o  trace  (X'X) 


(2.4)  E0  =  ( X ' X+KI ) ” 1 ( X ' X ) 9 

Therefore,  9  is  a  biased  estimator  of  0  unless  K  =  0,  in  which  case  9=0.  Let 
P  be  an  orthogonal  matrix  diagonalizing  X'X,  that  is 

PX'XP’  =  D 


where  D  is  a  diagonal  matrix  with  the  ith  diagonal  element  equal  to  X . .  Let 


i  =  (a^» 


(2.5) 


,  a  )'  =  P0.  The  mean  squared  error  is  given  after  simplification  by 

2  rP  Xi  2  r  ^ 

E  (0-9) '  (0-0)  =  a  l  - j  +  K  1 


i=l  (X.+K) 

l 


(X.+K)' 

l 


From  (2.2)  and  (2.5)  we  see  that  for  any  given  K>  0 

MSE  0  >  MSE  8 


for  sufficiently  large  values  of  9'0.  Therefore,  the  ridge  estimator  can  be 
compared  to  the  least  squares  estimator  only  if  0  is  constrained.  Suppose  it  is 
known  apriori  that  9' 9  <_c  where  c  is  a  positive  number.  This  condition  would  be 


realized  in  many  practical  situations.  Since  9 '0  =  a 'a  ,  we  have  a^c, 

i  =  1,  . ..,  p.  Hence  from  (2.5)  we  get 

P 

(2.6) 


MSE  9<_o^ 


Xi 


i=l  (A ,+K) 
1 


.  1 


i=l  (X  +K)‘ 
i 


Theorems  2.1,  2.2,  and  2.3  below  give  certain  results  on  the  choice  of 
K  in  order  that  the  ridge  estimator  has  smaller  mean  squared  error  than  the 
least  squares  estimator. 

2 

Theorem  2.1.  If  9  ‘  9  <  c  then  MSE  9  <  MSE  9  for  0<  K <  — — 

Proof:  From  (2.6)  we  have  for  K< - 

—  c 


2  P  1  : 

MSE  8<o  [ 


i=l  (A  ,+K)' 

l 


2K 


(X  .  +K)  ‘ 
1 


,  P  X.+2K 
Z  r  1 


=  a  i  - 2 

i=l  (X.+K) 
l 


2  V  i_ 

<  0  i, 

i=l  i 


=  MSE  0  . 


2  P 


Theorem  2.2.  If  fi'9<2 —  7  — —  then  MSE  9  <  MSE  0  for  K>0. 

_  \ 


p  "i-l  Xi 


Proof:  Let  D(K)  denote  the  quantity  on  the  right  hand  side  of  (2.6). 

Differentiating  D(K)  with  respect  to  K  we  have 


(2.  7) 


p  2\ . (cK-c  ) 

3DOO/3K  =  7  — - - r - 

i=l  (X.+K)2 
1 


The  ri  gh^  hand  side  of  (2.7)  is  equal  to  zero  for  K  =  —  and  is  <(>)  0  for 
.<  <(>)  2-^  .  Therefore,  D (K)  is  first  decreasing  then  increasing  as  K  varies 
from  0  to  ®.  Now 


D (0)  =  a 


2  r* 


i=l 


=  MSE  3 


D  (°°)  =  pc  . 


A  ' 


1 


Hence 


D (c)  <  max(pc,MSE  9) 


2 

=  MSE  9  for  c  <  — 

~P 

Since  D (K)  is  an  upper  bound  on  the  value  of  MSE  9,  the  theorem  follows.  Q 

From  a  Bayesian  point  of  view  suppose  that  the  components  of2  9  are  inde¬ 
pendently  and  identically  distributed  with  means  £  and  variance  t  . 

Theorem  2.3.  If  the  components  of  9  are  independently  and  identically  distri¬ 
buted  with  mean  £  and  variance  t2  then  the^av^rage  mean  squared  error  of  the 
ridge  estimator  is  minimized  for  k  -  a  /(£  +t  ). 

Proof:  Let  E  denote  expectation  with  respect  to  the 

9 .  We  have 

2  P  Ai  2 

(2.8)  E  (MSE  9)  =  0  l  - 2  +  K 

i=l  (X.+K) 

i 

As  in  the  proof  of  £her^em22.2  we  find  that  the  right  hand  side  of  (2.8)  is 
minimized  for  K  =  a  /(£;  +t  ).  [] 

We  have  seen  above  that  the  ridge  estimator  is  preferred  to  the  least  squares 
estimator  in  certain  situations  when  the  parameter  0  is  constrained.  The  following 
theorem  gives  a  basis  for  the  choice  of  the  ridge  estimator  under  the  given  con¬ 
straint. 

Theorem  2.4.  The  value  of  9  minimizing  R(9)  =  (Y-X9)'  (Y-X0),  given  6 ' 9  <_  c  is  equal 
to  0  where  K  is  chosen  such  that  9'9  =  c. 

Proof:  By  direct  computation  we  get 

-  ~  -2 

(2.9)  9 ' 0  =  (PX'Y) ' (D+KI)  (PX'Y)  . 

It  is  seen  from  (2.9)  that  9' 9  is  decreasing  in  K.  Therefore,  the  value  of  K, 
given  by  9' 9  =  c  is  uniquely  determined. 

We  have 

(2.10)  R(9)  =  (Y-X0) ' (Y-X9) 

=  (Y-X9) ' (Y-X0)  +  (X'Y) ' [ (X'X+KI)”1- (X'X)-1] 

X’X[(X'X+KI)  1-(X’X)  1 ]  (X*  Y ) 


given  prior  distribution  of 


_2  2 
5  +x 


i=l  (X.+  k) 

l 


(Y-X9) ' (Y-X9)  +  (PX'Y)'  D*  (PX'Y) 


1 


where  D*  is  a  p * p  diagonal  matrix  whose  ith  diagonal  element  is  equal  to 


2  2 
XT (k+X.  ) 

l  l 


From  (2.10)  we  see  that  R(0)  is  increasing  in  k. 


Consider  the  problem  of  minimizing  R(0)  with  respect  to  0  under  the  con¬ 
straint  0 1 0  —  c.  By  the  Langrangian  method  the  minimizing  value  of  9  is  given 
by 


X0  -  X' (Y-X0)  =  0 


or 

0  =  (x'x+xi)-1x’y 


where  X  is  determined  such  that  0'0  =  c.  That  is,  the  minimizing  value  of  9  is 
the  ridge  estimator  0,  where  k  is  determined  such  that  0*0  =  c. 

We  have  shown  above  9  '0  is  decreasing  in  k  and  that  R(9)  is  increasing  in 
k.  It  follows  that  0  which  minimizes  R(0)  under  the  constraint  0'0  =  c, 
minimizes  R(0)  also  under  the  constraint  0'0<_c.  Q 

Remark:  We  have  a  comparison  between  the  least  squares  estimation  and  ridge 
estimation.  The  ridge  estimator  is  given  by  minimizing  R(0)  under  a  certain  con¬ 
straint  on  the  value  of  9'0,  whereas  the  least  squares  estimator  is  given  by  mini¬ 
mizing  R (0 )  without  that  constraint. 

Throughout  the  foregoing  discussion  we  have  assumed  that  the  quantity  k 
arising  in  the  definition  of  the  ridge  estimator  0,  is  a  scalar  constant.  By 
letting  k  depend  on  the  observation  Y  suitably,  we  might  be  able  to  obtain  an  esti¬ 
mator  which  has  a  smaller  MSE  than  the  least  squares  estimator  for  all  values  of  9. 
Hoerl  and  Kennard  (1970, a)  have  suggested  an  iterative  method  of  choosing  such  a 
value  of  K.  However,  they  did  not  show  that  the  final  estimator  had  a  smaller  MSE 
than  the  least  squares  estimator.  On  the  other  hand,  (see  Alam  (1974))  any  esti¬ 
mator  of  the  form 

<t>  (  !Y'X(X,X)”1X'Y)/C2)  e 

has  smaller  MSE  than  0  where  $(Z)  is  a  function,  such  that,  Z(1-$(Z))  is  non¬ 
decreasing  in  Z  and  0  <_  Z  ( l-$  (Z)  )  <_  2ap  -  4  and 


f 


min (X , ,  . . .  ,  X  )  . 
1  P 


See  also,  Sclove  (1963)  and  Stein  (1960). 
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